TASK ASSIGNMENT HEURISTICS FOR PARALLEL AND 
DISTRIBUTED CFD APPLICATIONS 
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Abstract. This paper proposes a task graph (TG) model to represent a single discrete step 
of multi-block overset grid computational fluid dynamics (CFD) applications. The TG model is 
then used to not only balance the computational workload across the overset grids but also to 
reduce inter-grid communication costs. We have developed a set of task assignment heuristics based 
on the constraints inherent in this class of CFD problems. Two basic assignments, the smallest 
task first (STF) and the largest task first (LTF), are first presented. They are then systematically 
enhanced by integrating the status of the processin" units and the inters rocssscr communication 
costs. To predict the performance of the proposed task assignment heuristics, extensive performance 
evaluations are conducted on a synthetic TG with tasks defined in terms of the number of grid points 
in predetermined overlapping grids. A TG derived from a realistic problem with eight million grid 
points is also used as a test case. 
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1. Introduction. The availability of massively parallel computational resources 
poses a challenge to the development of efficient algorithms for high-performance sci- 
entific computing. A possible application is the high-fidelity solution of Navier-Stokes 
equations to predict aerodynamic flow characteristics around complex aerospace con- 
figurations. High-end supercomputing is required to reduce the turn-around time of 
such computational fluid dynamics (CFD) simulations. Multiple processors may be 
either tightly coupled or geographically separated and networked into a single virtual 
supercomputer. Seamless access to distributed resources is enabled by metacomput- 
ing toolkits such as Globus [7, 8]. Due to the inherent cost effectiveness of aggregated 
computing, the distributed approach has attracted significant interest and become a 
research priority in recent years. It has also been the main driver behind the devel- 
opment of NASA’s Information Power Grid (IPG) [9]. 

To handle complex geometric configurations, NASA’s CFD production code called 
OVERFLOW [3] decomposes the flow domain into a union of overset structured grids 
(also referred to as zones), each of which covers a relatively simple region of the 
domain. In the parallel implementation, a bin-packing strategy is used to cluster 
individual grids into groups, where the number of groups is equal to the number of 
processors. To avoid poor volumetric load balance, the larger grids can be further 
partitioned into subgrids. As a result, the number of grids and subgrids (collectively 
known as tasks in this paper) easily exceed the total number of available processors. 
Overset grid CFD schemes proceed by computing numerical solutions for each task 
and then updating boundary data across overlapping grids, generating the bulk of 
information transferred between the processors hosting the tasks. Effective task as- 
signment schemes must . therefore not only balance the computations but also reduce 
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interprocessor communication costs. 

Several partitioning schemes for load balancing exist [22], but most are static in 
nature and not suitable for dynamic reconfigurations. The task graph (TG) model 
proposed in this paper is used to represent a discrete step of the overset grid CFD 
simulation process. It is thus able to handle dynamic load balancing requirements by 
modifying the TG from one step to the next as necessary. The model also enables us to 
explore the feasibility of several allocation schemes so that the constraints inherent in 
the underlying applications are observed. The heuristics developed are based on two 
primitive assignments: smallest task first (STF) and largest task first (LTF). These 
assignments are subsequently enhanced by the systematic integration of the status 
of the processing units and the inter processor communication costs. To evaluate 
and compare these proposed heuristics, a synthetic TG is generated where tasks are 
defined in terms of the number of grid points in predetermined overlapping grids. A 
TG derived from a realistic problem with eight million grid points is also used as a 
test case. 

The remainder of this paper is organized as follows. Section 2 gives a brief 
overview of the OVERFLOW CFD code and cites some related work. Section 3 
describes the TG representation of a discrete step in OVERFLOW, while Section 4 
presents our proposed task assignment heuristics. Section 5 explains how synthetic 
TGs are generated for the purpose of evaluating the heuristics. Detailed performance 
results for synthetic and real TGs are presented and discussed in Section 6. Finally, 
Section 7 concludes the paper with a summary and some key observations. 

2. Preliminaries. OVERFLOW [3], NASA’s high-fidelity overset grid CFD pro- 
duction code, owes its popularity within the aerodynamics community due to its 
ability to handle complex configurations. These designs typically consist of multiple 
geometric components, where individual body-fitted grids can be constructed easily 
about each component. The grids are either attached to the aerodynamics configu- 
ration (near-body) or detached (off-body). The union of all near- and off-body grids 
covers the entire computational domain. In this work, we use a special version of 
OVERFLOW called OVERFLOW-D [16]. 

Both OVERFLOW and OVERFLOW-D use a Reynolds- averaged Navier-Stokes 
solver, augmented with a number of turbulence models. However, unlike OVERFLOW 
which is primarily meant for static grid systems, OVERFLOW-D is explicitly designed 
to simplify the modeling of components in relative motion (dynamic grid systems). 
At each time step, the flow equations are solved independently on each zone in a 
sequential manner. Overlapping boundary inter-grid data is updated from previous 
solutions prior to the start of the current time step using a Chimera interpolation 
technique [23]. OVERFLOW-D uses finite differences in space, with a variety of 
implicit /explicit time stepping. 

Parallelization of OVERFLOW-D has been developed around its multi-block fea- 
ture which offers a natural coarse-grained parallelism based on the message passing 
programming model. The MPI library is used to communicate the overlapping bound- 
ary data across processes. To facilitate parallel execution, a grouping strategy is used 
to assign each grid to an MPI process; however, parallel efficiency of the overset ap- 
proach depends critically on how this grouping is performed. A number of simple and 
sophisticated grouping strategies for overset applications are discussed in [5]. 

The Chimera interpolation procedure [23] determines the proper connectivity of 
the individual grids. Adjacent grids are expected to have at least a one-cell (single 
fringe) overlap to ensure the continuity of the solutions; for higher-order accuracy and 
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to retain certain physical features in the solution, a double fringe overlap is sometimes 
used. A program named Domain Connectivity Function (DCF) [17] computes the 
inter-grid donor points that have to be supplied to other grids. The DCF procedure 
is incorporated into the OVERFLOW-D code and fully coupled with the flow solver. 
All boundary exchanges are conducted at the beginning of every time step based on 
the interpolatory updates from the previous time step. For dynamic grid systems, 
DCF has to be invoked at every time step to create new inter-grid boundary data. 

In this work, a task graph (TG) is used to represent the interaction between the 
overset grids. Each grid is represented by a node in the TG, and a pair of overlapping 
grids is indicated by an edge between the . corresponding nodes. Formally, a TG 
G(V.E) consists of a set of vertices V = i > 1 to represent tasks, and a set 

of edges E = {e*}, i > 1 to represent precedence constraints. If the execution time 
of each task is constant, calculating the job completion time is straightforward when 
assuming an unrestricted number of processing units. However, in a networked system, 
a task’s execution time depends on the characteristics of the processor to which it 
is mapped while communication times depend on the latency and bandwidth of the 
interconnect. Thus, estimating and minimizing the total job completion time becomes 
an optimization problem that involves the proper scheduling of tasks to processors. 
An optimal non-preemptive schedule of independent tasks to be executed on a two- 
processor system is NP-complete [10]. However, linear and polynomial mapping times 
can be achieved if the structure of the TG is restricted; such is the case for the 
two-level directed acyclic graphs reported in [15]. To reduce complexity and make 
the procedure feasible for dynamically load balancing OVERFLOW-D TGs on large 
numbers of processors at the risk of obtaining sub-optimal results, novel scheduling 
heuristics are developed and presented in this paper. 

A solution technique for series-parallel TGs is reported in [20, 21] as part of a 
software package called SHARPE. Other related work combines TGs and queueing 
theory; an analytical approach is presented in [1] based on the solution of synchronous 
queueing networks. Also, [24] reports the use of a hierarchical approach that com- 
bines Markov models and TGs. In [14], this combination is applied to the perfor- 
mance prediction of TGs executing in shared-memory multiprocessor environments. 
Stochastic Petri Nets (SPN) are also used to represent parallel programs. In [2], a 
set of translation rules maps language constructs into SPN- based segments that are 
then used for automatic translation of parallel programs. Simple SPN-based models 
can also capture precedence constraints and the restrictions imposed by assignment 
heuristics [12, 13]. The use of SPNs to represent TGs extends the analysis to ob- 
tain probability distributions of completion times as well as average execution times 
for different heuristics and computational configurations. Another advantage is that 
system scalability can be predicted as additional processing units become available. 

3. Task Graph Representation. The use of a task graph (TG) to represent 
overset grid CFD simulations makes the performance prediction of several assignment 
heuristics possible. Task allocation heuristics attempt to minimize total job execution 
time based on criteria such as individual task run times and the volume of data 
exchanged. 

Let Z = {Zi, Z 2 , . . . , Z m } define a collection of m zones in the CFD problem. 

A large zone Zi can be further subdivided into a fixed number of k% subzones, such 
that Z* = {Z-a, Z^, . . . , Z^.}. Note that a partitioned Z* consists of ki > 1 non- 
overlapping sets of grid points [6]. We use the term PZ to refer to these partitioned 
zones. In the presence of PZs, the heuristics must also consider constraints such 
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Fig. 3.1. Task graph representation of a single iteration of an overset grid OFD application 


as pre-assignments to different processors to guarantee a parallel execution of the 
corresponding tasks. PZs are indicated by the dashed oval in Fig. 3.1 which illustrates 
a TG representation of a single iteration of the CFD problem. The set Z is clearly 
identified in the TG. It is convenient to map Z into a linear set T of q tasks 2 * with no 
dependencies such that all tasks become identified with a single index. For example, 
the set T = {Ti, {T 2 , T 3 , T 4 }, {T 5 , Tq, T 7 , T s }, T9, . . . , T q } identifies a set Z with zones 
Z\ = Tu z 2 = {T 2 ,r 3 ,T 4 }, Z 3 = {T 5 ,T 6 ,T 7 ,r 8 }, Z 4 = r 9 , and so on, including 
Z m = T q . Note that, in this case, Z 2 and Z$ are PZs containing three and four 
subzones, respectively. Dummy tasks Di with zero execution times represent tasks 
receiving data needed for the next iteration of the computational process. An iteration 
starts and ends at nodes S and E, respectively. The arcs represent interactions 
between tasks; weights can be associated to these arcs to represent the volume of 
data transferred. 

4. Task Assignment Heuristics. In this section, we describe our task as- 
signment heuristics. To guarantee parallel execution, all tasks corresponding to PZs 
must be pre-assigned to different processors and coordinated to begin execution si- 
multaneously. This is an important constraint enforced by each heuristic. Once the 
pre- assignment of PZs is made, the remaining tasks are allocated based on the as- 
signment criteria. For q tasks and n processors, several tasks will end up on the same 
processor since q n. All tasks are executed by the processors in the order in which 
they are assigned into the n groups: Gj, j = 1 , 2 , . . . , n. 

Let E(Gj) denote the computation time of all the tasks in group Gj (i.e. assigned 
to processor Pj) y and let X{ denote the computation time of task T». Then, 


(4.1) 


E(Gi) = T X > 

TiZGj 


and the execution time E of a single iteration is 
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(4.2) E = max{£(Gy)} 

j 

Assuming data generated by the tasks in Gj is routed in a serial fashion to other 
tasks, the transfer cost C{Gj) of all tasks in Gj is 


(4.3) 
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Cik is assumed zero if both T* and T % are assigned to the same processor. Instead, if 
Ti and T \ are assigned to processors P x and P yi respectively, it is estimated [4] as 


(4.4) Cjfc — i^L X y 4“ Vikf Pxy) &ix &ky 

where L xy and B xy are the latency and bandwidth between P x and P y , Vik is the 
volume of data generated by Ti with Tk as destination, and <%- is a binary entry of an 
assignment matrix A qn . The entry is set to one if Ti is assigned to Pj\ otherwise, it 
is zero. An indicator function I{x) that returns unity if the argument x is true is used 
to determine the values of a^. The specific form of I{x) depends on the assignment 
heuristic. An independent model could also be used to obtain a better estimate of 
the communication costs. 

By combining Eqs, 4.1 and 4.3, the execution time E+ of a single iteration is 
obtained as 


(4.5) E + = majc{E(Gj) + C(Gj)} 

j 

Note that Eq. 4.5 enhances Eq. 4.2 by including the interprocessor communication 
overhead. Since this scheme separates computation and communication times, assign- 
ment heuristics can be developed based only on task computation times followed by 
estimated communication costs. 

The following assumptions need to be highlighted at this point: 

• Each CFD iteration is synchronized across processors and cannot commence 
execution until all data generated in the previous iteration is in place. 

• Processors may be idle while communication takes place. They could also be 
idle as a result of the synchronization of iterations. 

• The total application execution time £ is scaled such that £ = otE . The term 
a is the number of iterations required to complete the entire CFD simulation. 

Note that an upper bound of processor idle time IT can be obtained as 


(4.6) IT = E+ - mm{E(Gj) + C(Gj)} 

3 

This is a global measure that could be useful as an objective function to be optimized 
if idle times for each processor were used during the assignment process. 

Another metric that could be used as a measure of the effectiveness of an assign- 
ment is the load imbalance factor LI F: 
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(4.7) 


LIF = 


ZjWGri + CjGj)) 

nE+ 


For heuristics that do not consider communication costs, the term C(Gj ) is dropped 
and E+ is replaced with E in Eq. 4.7 when calculating LIF. 

Let us now describe our proposed task assignment heuristics. The first two are our 
basic strategies and depend on whether the smallest or the largest tasks are assigned 
first. The next four enhance these basic techniques by incorporating the status of the 
processing units in terms^of their minimum finish times or largest idle times. Finally, 
the last four further integrate the interprocessor communication costs. The overall 
relationship among the 10 heuristics is shown in Fig. 4.1. Note that the assignment 
matrix A qn is updated each time a task is allocated to a processor using the indicator 
function /( x) as specified below for each heuristic. 



FiG. 4.1. Overall relationship among our 10 proposed task assignment heuristics 


4.1. Smallest Task First (STF). In this scheme, the smallest unassigned task 
Ti is allocated to the next processor Pj selected from a list sorted in ascending order 
of their index. The assignment is determined such that 


(4.8) aij — I(Xi = min{Xfc}) 

k 

where X^s are the execution times of all the unassigned tasks. In other words, tasks 
are assigned in ascending order of their execution time to processors in round-robin 
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fashion. All tasks corresponding to a PZ are assigned first; in this case, they are 
selected randomly within a single PZ set. If clustered processors are properly mapped 
into this sorted list, then most PZs are likely to be assigned to clustered processors 
and communication costs reduced accordingly 

4.2. Largest Task First (LTF). This heuristic is the inverse of STF and re- 
ported in [19]. An assignment is determined as 


(4.9) aij = I(Xi = maxpr*}) 

Here, processor Pj is again the next processor from a list sorted in ascending order 
by index; however, the tasks are now sorted in descending order of their computation 
times. As in STF, all tasks corresponding to a PZ are assigned first. Note that LTF 
wall give the same assignment as STF because it is merely a maxsort versus minsort 
of the tasks. 

4.3. STF with Minimum Finish Time (STF_MFT). The minimum accu- 
mulated time Acj at processor Pj is combined with the STF criteria. Thus, task T\ 
is assigned to processor Pj such that 

(4.10) — I(Acj = min{Xi} + mm{Acj}) 

* j 

In this scheme, the unassigned task Ti with the minimum computation time X* is 
allocated to the processor that becomes free in the shortest time. Note that STF-MFT 
will result in the same assignment as STF since the latter, by design, automatically 
allocates the next task to the processor with the minimum finish time. This strategy 
is a variation of the heuristic reported in [18]. 

4.4. LTF with Minimum Finish Time (LTFJVtFT). This scheme is a vari- 
ation of Eq. 4.10 as the task with the maximum execution time is scheduled instead. 
Hence, 


(4.11) — I{Acj = max{X*} 4- min{Acj}) 

* 3 

This task assignment heuristic is similar to the bin-packing strategy described in [6]. 
Note that LTF-based assignment heuristics should generally perform significantly bet- 
ter than the corresponding STF-based strategies because the largest tasks are allo- 
cated first. 

4.5. STF with Largest Idle Time (STF-LIT). This scheme is another al- 
ternative that combines STF and the largest idle time of a processor. It is similar to 
STF-MFT except that an upper bound of the idle time is determined by the max- 
imum accumulated time. Let hj denote the current idle time of processor Pj with 
respect to maxfc {Ac/c}. Then, 


(4.12) hj = maxfAcfc} - Acj 

k 

Thus, initially all hj evaluate to zero and the first n tasks chosen randomly (assuming 
they can execute in parallel) are assigned to the n processors. Thereafter, hj is selected 
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begin procedure STFJLIT 

1. Compute hj using Eq. 4.12 and sort in descending order 

2. Select next task Ti such that its Xi is minimum 

3. Assign T{ to processor Pj such that its hj is maximum 

4. Update % accordingly 

5. If (A ij = ma Xj{hj} — min^X;} < 0), repeat from step 1 

6. Else repeat from step 2 
end procedure STFUUT 

Fig. 4.2. Assignment procedure {or the STFJLIT heuristic 


from a descending order list and Xi from an ascending order list. The task Ti with 
the smallest Xi is assigned to the processor with the largest hj. The assignment of 
these remaining tasks is summarized in Fig. 4.2. Note that every time the hi list is 
modified, it is reordered again with an added sorting cost of O(nlogn). The hj values 
change depending on A ij = ma Xj{hj} — rnin*{X*}. 

4.6. LTF with Largest Idle Time (LTF JLIT). This scheme is a variation of 
STF-LIT except that all computation times are now sorted in descending order. 

4.7. STFJMFT with Communication Costs (STFJMFT_.CC). We now 
incorporate the interprocessor communication costs into the task assignment heuris- 
tics. However, STFJLIT and LTF JUT are no longer considered because their perfor- 
mance is similar to STFJMFT and LTF -MFT, respectively. We first ignore the fact 
that some destination tasks may not yet be allocated to processors and therefore the 
communication cost must be estimated. The STF_MFT_CC strategy assigns non-PZ 
tasks with minimum computation times and communication costs to processors with 
the current minimum finish time. The assignment matrix is created as 


(4.13) dij = I(Acj = minfX* 4- C* } -f min{Mcy}) 

i j 

Note that the only modification to the STFJMFT indicator function in Eq. 4.10 is that 
a task is selected such that the sum of its computation and communication times is 
minimum. This value is then added to the accumulated time of the selected processor. 
The communication time for task Ti is obtained as C* = Cik, where the index k 
identifies all destination tasks. A drawback of this scheme is that all destination 
tasks are required to be already allocated. If this is not the case, the communication 
cost is estimated by assuming constant latency L and bandwidth and then using 
Eq. 4.4. Our implementation steps are shown in Fig. 4.3. The adjustments in step 3 
are required because no communication costs are incurred if two interacting tasks are 
assigned to the same processor. 

4.8. LTFJMFT with Communication Costs (LTFJVfFT_CC). Task as- 
signments under this scheme are similar to those represented by Eq. 4.13 except that 
they are now conducted with respect to the maximum value of (Xi 4- Ci): 


(4.14) 


Oy = I(Acj = max{Xj + O} + min{/lc,}) 
* i 
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begin procedure STF_MFT.CC 

1. Sort tasks Ti in ascending order in terms of (Xi + Q) 

2. For each processor Pj, set Ac, = 0 

3. For each task Ti allocated to processor Pj 

A.Cj — Acj -}~ Xi 4* C/i 

4. For each task Ti allocated to processor Pj 

4.1. If a destination task Tk is assigned to the same processor Pj, do 

Acj — Acj Cih 

4.2. If a predecessor task Tk is assigned to the same processor Pj , do 

Acj ~ Acj Cki 

end procedure STF_MFT_CC 

Fig. 4.3. Assignment procedure for the STF.MFT-CC heuristic 


4.9. STFJV1FT with Actual Communication Costs (STF_MFT_ACC). 

In this scheme, tasks are allocated to processors according to Eq. 4.10 and communi- 
cation costs are calculated only after predecessor and destination tasks are assigned. 
Network latency and bandwidth depend on where the interacting tasks are actually 
allocated. The addition of whenever possible updates the processor accumulated 
time that integrates communication costs at least partially. An outline of the imple- 
mentation algorithm is shown in Fig. 4.4. 


begin procedure STF_MFT_ACC 

1. Sort tasks Ti in ascending order in terms of X% 

2. For each processor Pj, set Acj — 0 

3. For each task Ti in sorted list 

3.1. Assign task Ti to processor Pj with minimum Acj 
Acj = Acj + Xi 

3.2. For each task T r € R{Ti) assigned to processor P*. ^ Pj, do 

Acj — A-Oj d - Cj 7* 

3.3. For each task Td € D(Ti ) assigned to processor P m ^ Pj, do 
ACm ACyji -J- Cdi 

end procedure STF_MFT_ACC 

FlG. 4.4. Assignment procedure for the STF.MFT^A CC heuristic 


As task Ti is assigned, the set of tasks R(Ti) receiving data from Ti but assigned 
to a different processor is updated. The accumulated time of the processor hosting 
Ti is also adjusted with the communication costs for all tasks in R(Ti). Likewise, 
the set of predecessor tasks D(Ti) donating data to Ti but assigned to a different 
processor is updated. The accumulated time of all processors hosting tasks in D(Ti) 
is also updated. To illustrate how STFJMFT-ACC operates, consider the simple TG 
in Fig. 4.5 and the execution steps in Table 4.1 when assigning it to a two-processor 
system. 

4.10. LTFJMFT with Actual Communication Costs (LTF JV4FT.ACC) . 
The procedure given in Fig. 4.4 for STF_MFT_ACC apply here except that the alloca- 
tion follows Eq. 4.11 for which tasks are sorted in descending order of their computa- 
tion times. Table 4.2 shows the execution steps when assigning the TG in Fig. 4.5 to 
a two processors. Notice that the workload is balanced much better than in Table 4.1. 
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Fig. 4.5. Task graph of a small overset grid CFD application 


Table 4.1 

Execution trace of STF-MFT-ACC for the TG in Fig. ^.5 


1. 

Sorted tasks T 3 ,T 2 ,Ti,T 4 

3.1. 

T1-+P1 

2. 

Aci = Ac 2 = 0 


Ac 1 = 34 + 50 = 84 

3.1. 

T 3 -* Pi 

3.2. 

R(Ti) = m 


Aci = 0 + 30 = 30 


Aci = 84 + 2 = 86 

3.2. 

R(T 3 ) = 0 

3.3. 

D(T{) = {T 2 } 

3.3. 

D(T 3 ) = 0 


Ac 2 = 42 + 1 = 43 

3.1. 

t 2 ^p 2 

3.1. 

t 4 -+p 2 


Ac 2 = 0 + 40 = 40 


Ac 2 = 43 + 60 = 10; 

3.2. 

R(T 2 ) = {r 3 } 

3.2. 

R(T 4 ) = {T U T 3 } 


Ac 2 = 40 + 2 = 42 


Ac 2 = 103 4- 3 + 4 = 

3.3. 

D(T 2 ) = {r 3 } 

3.3. 

D(T 4 ) = {T u T 3 } 


Aci = 30 + 4 = 34 


Aci = 86 4- 1 + 3 = 


Table 4.2 

Execution truce of LTF-MFT-ACC for the TG in Fig. J+.5 


1 . 

2 . 

3.1. 

3.2. 

3.3. 

3.1. 

3.2. 

3.3. 


Sorted tasks T 4 , Ti, T2, X3 
Ac\ = Ac 2 — 0 
T4-P1 

Aci = 0 + 60 = 60 

P(T 4 ) = 0 
D(T 4 ) = 0 
T 1 ^P 2 

Ac 2 = 0 + 50 = 50 

^(Ti) = {T 4 } 

Ac 2 = 50+1 =51 
■D(Ti ) = m 
Ac 1 = 60 + 3 = 63 


3.1. X 2 — * ^*2 

Ac 2 = 51 + 40 = 91 

3.2. R(T 2 ) = {T 4 } 

Ac2 = 91 + 3 = 94 

3.3. D(T 2 ) = {T 4 } 

Ac\ — 63 + 3 = 66 

3.1. T 3 -*Pi 

Ac 1 = 66 + 30 = 96 

3.2. R(T 3 ) = {T 2i T 4 } 

Aci = 96 + 4 + 3 = 1 

3.3. U(T 3 ) = {r 2) T 4 } 

Ac 2 = 94 + 2 + 4 = 1 
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5. Synthetic Task Graph Generation. All of our proposed heuristics can be 
evaluated with synthetic TGs. Given the total number of grid points ngp and the 
number of tasks q , the synthetic TG generation process consists of three steps. The 
first step is specifying the number of grid points for each zone 2*. The procedure 
GEN-GRIDS outlined in Fig. 5.1 accomplishes this goal. Basically, each Z* is initially 
allocated a random number of grid points less than [ngp/qj. If the total number of 
grid points for all q zones is less than ngp , the remaining grid points are randomly 
added to one of the zones. 



for (i = 1 to i = q) 

Generate a random number x € {1, ngp/q} 

Allocate x grid points to Z* 
sum = sum + x 
end for 
if ( sum < ngp) 

Generate a random index % € {1, q} 

Allocate ( ngp — sum) additional grid points to Z x 
end if 

end procedure GEN.GRIDS 

Fig. 5.1. Procedure for generating zones containing different numbers of grid points 

A second procedure called GEN.TG, outlined in Fig. 5.2, generates the topology 
of the TG with tasks defined by zones generated by GEN_GRIDS. The number of 
overlapping zones defines a window of size w = r x qx o where r is a random number 
between 0 and 1, and o is a user-supplied parameter that specifies the maximum 
percentage of q zones that can overlap. Each zone Z* may overlap with w/2 zones 
below and above its index i. 



Fig. 5.2. Procedure for generating the topology of a synthetic TG 


Finally, a third procedure called COMM.VOL generates the volume of data ex- 
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begin procedure COMM.VOL 
for (i = 1 to i = q) 

m = number of overlapping zones for Z* 
for (j = 1 to j = m) 

Obtain index k of j - th overlapping zone 
Cik = re x X k 

end for 
end for 

end procedure COMM.VOL 

Fig. 5.3. Procedure for generating communication volume between overlapping zones 


changed between interacting tasks in terms of the number of overlapping grid points. 
The outline of this procedure is shown in Fig. 5.3. This program reads a file (gener- 
ated by GEN.GRIDS) containing the number of grid points X* for each task and a 
file (generated by GEN.TG) containing the indices of overlapping tasks. The com- 
munication volume between task T % and T 3 is set as c^- = rex X 3 where rc is the 
fraction of grid points that are in the overlap region. 

6. Performance Results. A simple interface called Evaluate Assignment Heu- 
ristic (EVAH) has been implemented to compare and contrast the various heuristics. 
We first generated a synthetic TG containing 128 zones and 16 million grid points. The 
other parameters are set as follows: computation time per grid point xpg = 15 pse c, 
fraction of grid points in overlap region rc — 0.5, network latency L = 13 /zsec, and 
network bandwidth B = 37.3 Mbytes/sec. Latency and bandwidth are required to 
estimate the cost of data exchange between overlapping zones. The communication 
volume is calculated in terms of the number of overlapping grid points multiplied by 
a factor equal to the number of bytes per grid point required to exchange data. In our 
case, this factor is assumed to be 200 bytes. All evaluations are performed assuming 
the number of processors to be between 2 and 128. 

Two performance metrics are reported: load imbalance factor (LIF) and speedup 
(j S ). LIF is calculated using Eq. 4.7, while S is computed as S = ( ngp x xpg)/E + , 
where E+ is given by Eq. 4.5. Fig. 6.1 shows the LIF for all the heuristics based on 
STF, while Fig. 6.2 shows results for those based on LTF. Results for STFJAT and 
LTF _LIT are not presented because they are almost identical to those for STF_MFT 
and LTF.MFT, respectively. Both figures show results for the baseline STF and LTF 
heuristics for sake of comparison. While the general trends are similar, improvements 
are more significant for the LTF- based heuristics. The best overall results are obtained 
with LTFJVIFT-CC. This is expected for homogeneous processing systems. For het- 
erogeneous environments, as in computational grids, the LTF-MFT-ACC heuristic 
should do significantly better. 

Performance results for speedup S for the STF- and LTF-based heuristics are 
shown in Figs. 6.3 and 6.4. Again, the best results are achieved by LTFJMFT-CC. 
If the underlying system architecture possesses uniform inter processor communica- 
tion characteristics, the obvious choice is LTF_MFT_CC. Otherwise, one should use 
LTF_MFT_ACC which allows only actual communication costs to influence the task 
assignment scheme. 

Notice that the performance metrics are identical for all heuristics when the num- 
ber of processors n is 1 or 128. When n — 1, we are simulating sequential execution 
of the TG; hence, task assignment does not have any effect. When n — 128, each 
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FIG. 6.1. Load imbalance factors for STF-based heuristics on synthetic TG 



Fig. 6.2. Load imbalance, factors for LTF-based heuristics on synthetic TG 


processor executes exactly one zone since our synthetic TG has 128 zones. Thus, all 
assignment heuristics return the same result. However, the actual trend between 2 
and 127 processors depends on the assignment scheme and the TG. 

We now present results obtained using a real test case of eight million grid points 
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FlG. 6.3. Speedups for STF-based heuristics on synthetic TG 



Fig. 6.4. Speedups for LTF-hased heuristics on synthetic TG 


distributed among 41 zones. The plots in Figs. 6.5 and 6.6 present detailed com- 
parisons of computation and communication times per processor as predicted by the 
LTF-MFT-ACC assignment heuristic and those actually computed by OVERFLOW- 
D using LTF_MFT_ACC on an SGI 0rigin3000, for 16 and 32 processors. For the 
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Fig. 6.5. Computation and communication times per -processor for 16 processors on TG ob- 
tained from real test case 



Fig. 6.6. Computation and communication times per processor for 32 processors on TG ob- 
tained from real test case 


sake of clarity, plots of computation and communication times are shown at different 
scales indicated by the left and right vertical axes, respectively. 

The predicted computation times closely match the measured data for both 
n = 16 and n = 32. The predicted communication times are slightly off, espe- 
cially for certain processors in the n = 32 case. These mismatches are primarily due 
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to the assumptions in the task assignment heuristic. For example, LTF_MFT_ACC 
assumes certain values for network latency and bandwidth that are then used to cal- 
culate interprocessor communication costs. These parameters are estimated from the 
16-processor run, which explains the bigger discrepancy for n = 32. However, these 
parameters are not constants but change dynamically at run time depending on mes- 
sage size and interconnect topology, features that are not modeled by our assignment 
heuristics. The computation time per grid point, on the other hand, is relatively 
accurate and uniform across processors for a tightly coupled parallel platform such 
as the 0rigin3000. As a result, the predicted computation times match better with 
actual data. 

For applications running on small numbers of processors, such as the ones shown 
in Figs. 6.5 and 6.6, the communication cost is only a small fraction (less than 2%) 
of the computation time. The predicted total execution time is therefore hardly 
affected by the error in estimating the communication time, and matches' well with 
the measured data. Also, the larger spread in computation times for n = 32 reflects 
the general difficulty with load balancing as the number of zones approaches the 
number of processors. Results using larger test cases and more processors can be 
found in [5]. 


Table 6.1 

Comparison of various task assignment heuristics for the real TG on 16 processors 


Heuristic 

E (Eq. 4.2) 

E+ (Eq. 4.5) 

IT (Eq. 4.6) 

LIF (Eq.4.7) 

STF 

7.464 

7.553 

3.687 

0.729 

STF.MFT 

7.464 

7.553 

3.687 

0.729 

STFJLIT 

7.464 

7.553 

3.687 

0.729 

STF-MFT.CC 

7.464 

7.550 

3.684 

0.729 

STF.MFT-ACC 

7.464 

7.553 

3.687 

0.729 

LTF 

7.464 

7.559 

3.696 

0.728 

LTF.MFT 

5.933 

6.006 

1.190 

0.917 

LTF.LIT 

5.933 

6.013 

1.205 

0.916 

LTF.MFT-CC 

5.933 

6.015 

1.205 

0.915 

LTF.MFT-ACC 

5.933 

6.011 

1.193 

0.916 


Finally, Table 6.1 lists results obtained with all the task assignment heuristics 
using the real TG on 16 processors of an 0rigin3000. Observe that except for the 
basic LTF heuristic, all the other LTF-based strategies are significantly better than 
the STF-based schemes. This is expected because the largest tasks are allocated first 
in the LTF heuristics. Incorporating the communication cost model has very little 
effect in this case since the communication times are negligible compared to the total 
execution times. 

7. Discussion and Conclusions. In this paper, we first presented a task graph 
(TG) model to represent a single step of multi-block overset grid computational fluid 
dynamics (CFD) applications. The nodes of the TG correspond to the individual 
grids, while an edge between two nodes indicate overlapping grids. We then described 
a set of task assignment heuristics tailored to meet the load balance requirements 
of this class of CFD problems on high performance parallel and distributed systems. 
The heuristics were derived by first considering two basic criteria to assign tasks to 
processors: smallest task first (STF) and largest task first (LTF). These heuristics 
were then systematically enhanced by integrating the status of the processing units in 
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terms of their minimum finish times or largest idle times. Finally, the heuristics were 
modified in a way that interprocessor communication costs would reflect the type of 
network being used- A synthetic TG containing 128 grids and 16 million grid points 
was used to study and compare the behavior of all the assignment schemes. A TG 
obtained from a real test case with eight million grid points and 41 grids was also 
analyzed and compared with measured data. 

The work reported in this paper is targeted to CFD users and intended for even- 
tual development of dynamic assignment schemes with minimum overhead. Data 
exchanges are currently assumed to have uniform cost; however, a realistic prediction 
of communication must take into account the variety and location of the computing re- 
sources. Future enhancements will include a. user-friendly iterative procedure to enable 
scientists and engineers achieve optimal scalability across available resources. While 
the emphasis of this work is performance prediction, suitable partitioning schemes 
such as MeTiS [11] can be used to provide an initial grid partition. 
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